Explorent Component

In introduction, we mentioned that during the explorent component part, where we have to adress our own question and answer it through the data science cycle, we have chosen two deal with two different things. The first one, extending the dataset with weather data and check how they improve or not our models performance. Last but not least, as the most difficult, to make a prediction not for the overall pickups but by clustering Boston's station and predict the rides per cluster.

Part 1. Prediction With Data From Other Sources

Feautures Engineering And Analysis

Pickups Data

Weather Data

These Weather Data have been downloaded from https://www.visualcrossing.com/weather-data using the Easy Global Weather API. They are hourly observations from weather stations of Boston also from the 01/01/2022 until the 31/08/2022. Since they have many different columns they will definitely need to be preproccessed. 
Let's start then!

Data Cleaning

At first glance, we drop the columns that we think are not very helpful and on the other hand we will process the weather columns that we think are instead. Later we will probably drop a few of them based on the knowledge we will acquire through the cleaning and analysis.
Checking for nan values and after that we start check each column one by one.

Temp

feelslike

 They seem very simillar except from the upper and lower bounds, something very reasonable as in extreme temperatures the feel is even worse that the observed value. 

Humidity

precip

Preciptype

snow

windspeed

Conditions

Severerisk

solar radiation

UVindex

Data Resample In Intervals And Merging

Weekend/Weekday & Holidays & Month/Hour/Minute & Season

Final Dataset for EDA

Extra Informations

The columns we have kept so far are:

By creating a heatmap of correlation, at the first glance we are able to identify the following:

Prediction

    Dropping the last columns we cocluded are not critical and transforming into category type the categorical one. Wnding up with the following columns:
Like before, the best algorithms are the non-linear ensemble models and they are also slighlty better than before. Backtesting strategy confirm our prediction, and the small overfitting remains. Definitely the most important features remain the lags of the intervals for both linear and non linear models with the hour following. However, linear models seems to be more sensitove also in weather columns like uvindex, humidity, conditions. Another inference, is that this sensitive is smoother in smaller time intervals than the bigger one. All in all, champion of models is again the ExtraTreesRegressor with the lowest RMSE and highest score and the linear models are better to figure the trend in smaller intervals when on the other hand the ensmble are slightly better on the bigger.

Part 2. Stations Clustering

And as we promised, let's start the most difficult part of the project! The prediction in station-clusters level. In this part we try to identify possible clusters in bike stations and predict their rides. We won't dive a lot into, just to have something left for the future :)
 Using the Elbow Curve algorithm to predict the clusters of stations, as in cluster analysis elbow curve is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.
After the Elbow Curve algorithm, we select 5 clusters as we see this is approximatelly the threshold of optimal value of clusters, from 5-7.
The number of rides per cluster.
 First visualization of the clusters, without layout.
The second and much better visualization of them, using the map of Massachusetts as layout. For reasons of memory we have used only a sample of the data. Zooming in the city the prediction seems well.
There is also a small cluster up here in Salem which the plot could not interpret. Probably this is because of the few data we have plotted.
We need to transform the data in a way we could apply our model. Selecting only the 60 minutes inteval, we apply our new type of models, the Multi Output Regressors.
This is our dataset's final format. Columns with the clusters pickups occuring during the index datetime and the lags with the pickups two hours ago. Based on that, we are going to create our train and test sets.
We are going to use a special type of regressors this time, the MultiOutputRegressor from sklearn which are a Multi target regression. This strategy consists of fitting one regressor per target. This is a simple strategy for extending regressors that do not natively support multi-target regression. And for the purpose of prediction we will test a linear and a non-linear model, LinearRegression() vs  RandomForestRegressor().
In a nutshell, our prediction didn't make us much more wise because we didn't identify a model with good prediction score although the 0.67195 R^2 of MultiOutputRegressor(ExtraTreesRegressor) is the benchmark model we were asked in project's description. The model as we can see is not able to understand the changes after a specific point where the rides raise a lot and in top of that none of the model is capable at all to predict the cluster_4. The cluster with the very few rides, the smallest of all. To sum up, the models have potential if they will be combined with more data like weather data, more observations and with the correct parametrization of them.